CODESYS String Libraries
Introduction
The libraries in the CODESYS String Libraries package can be used to process strings which are UTF-8 encoded. The basis is the IString
interface from the String Segments
library. Using this interface, the strings can be passed to the respective functions by reference. For example, to create an IString
instance, the GSB.UTF8String
function block from the Generic String Base
library is provided.
UTF-8 Encoding Support: Base function for handling UTF-8 encoded memory areas
String Builder: Efficient management of UTF-8 encoded string segments
String Segments: Base functions for
IString
instancesString Conversions: Conversion of strings of different encoding to/from UTF-8
String Functions: Functions for processing UTF-8 encoded strings following the example of the conventional standard library.
Unicode Support: Functions for processing UNICODE character categories.
Generic String Base: Function blocks for processing UTF-8 encoded strings which manage their memory statically via
GENERIC CONSTANT
.
Advantages of the new string libraries
The new string libraries can also handle large strings efficiently. For this reason, the libraries are also suitable for editing large text files and web contents.
UTF-8 is encoding which can represent the full range of characters according to UNICODE.
UTF-8 is widely used on the Internet and is recommended by the World Wide Web Consortium (W3C).
UTF-8 is compatible with legacy systems because of ASCII compatibility.
UTF-8 provides a high level of interoperability.
UTF-8 works to optimize memory.
The new string libraries let you query a previously defined string via corresponding methods, just as you know it from other high-level languages.
udiStringLen := myString.Len();
if udiStringLen = 22 THEN
...
Note
The new string libraries do not replace the old familiar string functions of the Standard
and Standard64
libraries. Nevertheless, we recommend using the new string libraries for new projects.
As of CODESYS 3.5.18.0, you can set the compiler to interpret the contents of variables of type STRING
as UTF-8 encoding. You select the UTF8 Encoding for STRING option in the Project Settings in the Compile options category.
If you do not want to treat all STRING variables in a project as UTF-8 encoded, the you need to clear this option. After that, you can apply UTF-8 encoding to individual literals of the STRING
type on a case-by-case basis.
{attribute 'monitoring_encoding' := 'UTF-8'}
sValue : STRING(140) := UTF8#'Ðα ṧтℯ♄ ḯḉℌ ηuη, i¢ℌ αямℯґ 𝕋øґ‼ Ṳᾔⅾ ♭ḯη $☺ ḱℓυℊ αł$ ωⅈ℮ ẕυ√◎ґ';
Thanks to the capabilities of UTF-8 encoding, you do not have to use the WSTRING
data type in CODESYS to use an extended character set. UCS-2 encoding, which WSTRING
is based on, may require more memory than a UTF-8 encoding, depending on the application. UCS-2 encoding always uses one WORD
per character and can represent only the characters U+0000
to U+D800
and U+DFFF
to U+FFFD
. UTF-8 encoding requires between one and four bytes per character. As a result, all Unicode characters can be processed.
With UTF-8 encoding, if you try to get a specific character using a specific index, then this will lead to unexpected results due to the variable length.
byValue := sValue[13]; // The 'u' is NOT the 13th character in the string
xOk := byValue <> 16#75;
You need to determine the index of a character by iterating through the string.
VAR
udiIndex, udiLength : UDINT;
diRune : UTF8.RUNE;
xOk : BOOL;
END_VAR
WHILE (diRune := TO_DINT(sValue[udiIndex])) <> 0 DO
IF diRune > 16#7F THEN
diRune := UTF8.DecodeRune(ADR(sValue[udiIndex]), 4, udiLength=>udiLength);
ELSE
// UTF-8 kodiert alle ASCII Zeichen (0-127) in ein Byte
udiLength := 1;
END_IF
IF diRune = 16#75 THEN
EXIT;
END_IF
udiIndex := udiIndex + udiLength;
END_WHILE
xOk := sValue[udiIndex] = 16#75;
Disadvantages of the established STRING functions
In the previously established STRING functions from the standard library, the parameters of type STRING
are copied when they are passed to the functions. The return value is also copied to a variable with the assignment.
VAR
sValue : STRING;
END_VAR
sValue := CONCAT(CONCAT(CONCAT('Da steh ich nun,', ' ich armer Tor!'), ' Und bin so'), ' klug als wie zu vor');
// -> Copy, LEN -> Copy, LEN -> Copy, LEN -> Copy, LEN
// -> 2xCopy, LEN
// -> 2xCopy, LEN
// -> 2xCopy, LEN
Before processing the parameters of type STRING
in the respective functions, their lengths often have to be determined by iteration up to the terminating null character. For longer strings, these copy and iteration operations increase the processing time of the application. The length of the strings is limited to 255 characters for the application of these functions.
Using the IString interface
The STR.IString
interface was introduced to pass the data structure which manages the information about a string by reference. This is a major difference to the previously established STRING functions, which do not implement the STR.IString
interface.
Furthermore, the size of a string (the respective memory for the UTF-8 encoded characters) may be in the numeric range UDINT
(4 ≦ udiSize ≦ 16#FFFFFF
).
Reference to the respective memory segment
Current capacity (→
GetSegment
)Length (→
Len
) in bytesNumber of "characters" (→
RuneCount
)
VAR
itfString : STR.IString;
udiLength, udiSize, udiRuneCount : UDINT;
pbySegment : POINTER TO BYTE;
xValid : BOOL;
END_VAR
udiLength := itfString.Len(); // Current length in byte
pbySegment := itfString.GetSegment(udiSize=>udiSize); // Address first byte, capacity of the segment in bytes
udiRuneCount := STR.RuneCount(itfString); // Current number of "characters" in the segment
xValid := itfString.IsValid(); // Indication that a valid UTF-8 encoding is present.
Correlation: "character" and "rune"
The term "rune" appears in the libraries and in the source code and means exactly the same as "Unicode code point", with an interesting addition.
The libraries define the word "rune" as an alias for the type DINT
. As a result, the user can clearly see when an integer value represents a code point. Moreover, what can be imagined as a character constant is called a runic constant.
Example: The type and value of the expression WSTRING#"⌘"
is a rune with the integer value DINT#16#2318
.